-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the structure of images in PixtralProcessor #35107
Conversation
cc @yurkoff-mv, can you try out this PR and confirm it resolves the issues you were having? You can use it with |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
also cc @zucchini-nlp and @ArthurZucker / @LysandreJik for core maintainer review |
@Rocketknight1, thank you for responding so quickly. Yes, these fixes work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, the new errors have to be tested and the doc should mention this to remove confusion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am wondering, is there any specific reason to support such a complex processing if at the end we end up flattening it in the modeling code. I think we should support both flat list/batch list as inputs similar to most other VLMs without trying to batch-list inputs before passing them to processors
It can help us in the long run to get standardized processors
Hi @zucchini-nlp, how do the inputs differ here compared to most other processors? I thought |
@Rocketknight1 usually we support both formats, as |
Closing because this has been included in #34801 instead |
The
PixtralProcessor
has some issues regarding correct nesting depth of inputs. If we are fully explicit about input nesting, thentext
should beList[str]
andimages
should beList[List[Image]]
. This is because each sample only has one text, but can have multiple images.The start of the Pixtral processor code handles cases when users don't supply fully nested inputs. However, it uses the heuristic that if the user passes
List[str]
fortext
andList[Image]
forimages
, then this indicates one image per sample. This is very confusing for users, especially because it means output changes whenbatch_size==1
, depending on whether that single input is passed asstr
or[str]
.With this PR, we avoid that assumption. If the user supplies a single image or flat list of images, then we assign them all to the first sample in the event that
batch_size==1
. Ifbatch_size>1
then we throw an error, asking them to supply an explicit list of lists instead. This seems much safer!